Applied Machine Learning: Summative Code 1/2 (Exploration)

Candidate Number: 1047904

07 May, 2021

**README:**

  1. Run notebook “CODE_Explore_1047904_Applied Machine Learning_Summative.ipynb" (this notebook) first to write out the pre-processed data and associated dytpe dictionary into FFChallenge_v5 folder.
  2. Run notebook “CODE_Predict_1047904_Applied Machine Learning_Summative.ipynb" (the other notebook) second to pick up previously written out by the first notebook. This will allow you to re-run all prediction experiments

Library Imports

Functions

Load Datasets & complementary resources

Data Pre-Processing

Removal of invariant columns

Mapping columns to categorical or continuous

Conflicting Value Analysis

Missing Value Analysis (Before Column Pruning)

Addressing instances not covered by FFC data dict

Pre Processing - Fixing erroneous columns

Pre Processing - Rectifying dtype conflicts & removing high NA columns

Pre Processing - Removing invariant continuous Columns

Pre Processing - Removing invariant categorical Columns

FIGURE - Removal of high NA columns

FIGURE - Plot Continuous & Categorical Feature Invariances

Create input feature df for train test split

Write out pre-processed data and dtype dictionary

FIGURE - Missing Label Distribution (Before vs. After Column Pruning)

FIGURE - Overview of Feature Engineering before train-test split